SNOW-3485471: cte dedup for self-join case#4245
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4245 +/- ##
=======================================
Coverage 95.38% 95.39%
=======================================
Files 171 171
Lines 44243 44252 +9
Branches 7557 7559 +2
=======================================
+ Hits 42203 42215 +12
+ Misses 1252 1250 -2
+ Partials 788 787 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| # Hash by alias values only, not the UUID keys, since UUID keys are regenerated on every deep-copy/re-resolve (e.g. the two | ||
| # branches of a self-join). This lets nodes representing the same computation hash identically, enabling CTE dedup for self-joins. | ||
| # NOTE: since nodes with different UUID keys can now share a CTE, _replace_duplicate_node_with_cte must merge each duplicate's | ||
| # UUID→alias entries into the shared CTE so parent re-resolution can resolve any UUID variant (see companion comment there). | ||
| # Different alias values (e.g. a "_WITH_AD_GROUP" join suffix from _disambiguate) still hash differently, preserving SNOW-2261400. | ||
| string = f"{string}#{sorted(set(node.expr_to_alias.values()))}" |
There was a problem hiding this comment.
what kind of cases will this produce same hash-keys. When it is wrong, what kind of risk are we dealing with
There was a problem hiding this comment.
- I assume you are asking the hash-keys for a node -- same hash-keys are produced by nodes with same sql, sql params, df aliases, and expr_to_alias values (before the change it's by expr_to_alias.items().
(hash-keys for expr_to_alias are generated from uuid, it can rarely happen that two uuid collide)
- the only risk is parent column resolution as we relax the standard of repeated node identification. I introduced extra logic to detect conflicting expr_to_alias (same expr maps to two different alias) in nodes to prevent CTE in this case. thank you for bringing it up
There was a problem hiding this comment.
Is there any chance of duplicate value entries here that would silently get swallowed by the sorted(set(...)) call?
There was a problem hiding this comment.
Yes, duplicate alias can happen but consolidating them is harmless in our case, it's not reflected in the newly-generated CTE optimized node, and the generated sql in the hash is enough for dup node detection in the self-join case.
the expr_to_alias is mostly for internal column name resolution in the downstream compilation stage.
theoretically I think this info shall be excluded from the hash computation of a node, but right now I keep it as a defensive layer to distinguish nodes.
| # When adding a lsuffix, expr alias map will be updated, so df2 and df3 are considered | ||
| # different and have different ids. So only df1 and df will be converted to a CTE | ||
| # With value-sort hashing, df1/df2/df3 now hash identically (same SQL + | ||
| # same alias values, different UUID keys). df2 and df3 are replaced with a | ||
| # shared CTE, but df1's left-join position remains inline. That gives 2 | ||
| # CTEs (base VALUES + filtered df1) and the filter appears twice (once in | ||
| # the CTE body, once inline for the left-join position). | ||
| assert ( | ||
| count_number_of_ctes(Utils.normalize_sql(df6.queries["queries"][-1])) == 1 | ||
| count_number_of_ctes(Utils.normalize_sql(df6.queries["queries"][-1])) == 2 | ||
| ) | ||
| assert Utils.normalize_sql(df6.queries["queries"][-1]).count(filter_clause) == 3 | ||
| assert Utils.normalize_sql(df6.queries["queries"][-1]).count(filter_clause) == 2 |
There was a problem hiding this comment.
for clarity, more CTE is better for this case, see the below comparison between the generated sql:
Before (1 CTE, filter appears 3 times):
WITH CTE_values AS (
SELECT $1 AS "A", $2 AS "B" FROM VALUES (1,2),(3,4)
)
SELECT * FROM (
(
SELECT "A_XXX", "B_XXX", "A" AS "A_YYY", "B" AS "B_YYY"
FROM (
(
SELECT "A" AS "A_XXX", "B" AS "B_XXX"
FROM CTE_values WHERE ("A" = 1) -- filter #1 (df1, inline)
) AS SNOWPARK_LEFT
INNER JOIN (
SELECT "A", "B"
FROM CTE_values WHERE ("A" = 1) -- filter #2 (df2, inline)
) AS SNOWPARK_RIGHT
)
) AS SNOWPARK_LEFT
INNER JOIN (
SELECT "A", "B"
FROM CTE_values WHERE ("A" = 1) -- filter #3 (df3, inline)
) AS SNOWPARK_RIGHT
)
After (2 CTEs, filter appears 2 times):
WITH CTE_values AS (
SELECT $1 AS "A", $2 AS "B" FROM VALUES (1,2),(3,4)
),
CTE_filtered AS ( -- df2 and df3 deduplicated here
SELECT "A", "B"
FROM CTE_values WHERE ("A" = 1) -- filter #1 (in CTE body)
)
SELECT * FROM (
(
SELECT "A_XXX", "B_XXX", "A" AS "A_YYY", "B" AS "B_YYY"
FROM (
(
SELECT "A" AS "A_XXX", "B" AS "B_XXX"
FROM CTE_values WHERE ("A" = 1) -- filter #2 (df1, still inline)
) AS SNOWPARK_LEFT
INNER JOIN (SELECT * FROM CTE_filtered) AS SNOWPARK_RIGHT
)
) AS SNOWPARK_LEFT
INNER JOIN (SELECT * FROM CTE_filtered) AS SNOWPARK_RIGHT
)
| # Hash by alias values only, not the UUID keys, since UUID keys are regenerated on every deep-copy/re-resolve (e.g. the two | ||
| # branches of a self-join). This lets nodes representing the same computation hash identically, enabling CTE dedup for self-joins. | ||
| # NOTE: since nodes with different UUID keys can now share a CTE, _replace_duplicate_node_with_cte must merge each duplicate's | ||
| # UUID→alias entries into the shared CTE so parent re-resolution can resolve any UUID variant (see companion comment there). | ||
| # Different alias values (e.g. a "_WITH_AD_GROUP" join suffix from _disambiguate) still hash differently, preserving SNOW-2261400. | ||
| string = f"{string}#{sorted(set(node.expr_to_alias.values()))}" |
There was a problem hiding this comment.
Is there any chance of duplicate value entries here that would silently get swallowed by the sorted(set(...)) call?
d854f34 to
5172f7f
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
In SCOS, spark.table() translates into SnowflakePlan nodes with non-empty expr_to_alias. The TPCDS_Q39 query aliases the same inventory DataFrame as inv1 and inv2 and self-joins them. The join path
deep-copies the plan for each branch and calls add_aliases separately, resulting in both branches carrying the same alias values under different UUID keys. The old key-sort hashing produced
different hashes for the two branches, so the CTE optimizer did not detect them as duplicates and inlined the full aggregation pipeline twice. The value-sort fix correctly identifies them as
the same computation and emits a single CTE.
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.